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Abstract 

This paper considers the inference of trends in multiple, nonstationary time series. 
£>-j To test whether trends are parallel to each other, we use a parallelism index based 

on the L 2 -distances between nonparametric trend estimators and their average. A 
■^J- central limit theorem is obtained for the test statistic and the test's consistency is 

established. We propose a simulation-based approximation to the distribution of the 
test statistic, which significantly improves upon the normal approximation. The test 
is also applied to devise a clustering algorithm. Finally, the finite-sample properties 
of the test are assessed through simulations and the test methodology is illustrated 
with time series from Motorola cell phone activity in the United States. 

(N 
> 

in 

££j 1 Introduction 

Comparison of trends or regression curves is a common problem in applied sciences. For 
example in longitudinal clinical studies, evaluators are interested in comparing response 
curves for treatment and control groups. In agriculture, it may be relevant to compare 
at different spatial locations the relationship between yield per plant and plant density 
(Young and Bowman, 1995). In biology, assessing parallelism between sets of dose-response 
data allows to determine if the biological response to two substances is similar or if two 
different biological environments give similar dose-response curves to the same substance 
(Gottschalk and Dunn, 2005). Also in economics, a standard problem is to compare the 
yield over time of US Treasury bills at different maturities, or the evolution of long-term 
rates in several countries (Park et al., 2009). 
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The statistical methodology developed in this paper is motivated by a collection of 
time series of cell phone download activity (applications, audio, images, ringtones, and 
wall papers) collected by Motorola in the United States between September 2005 and June 
2006. The measurements were collected hourly and aggregated at the area code level (129 
area codes were observed in total). A question of considerable interest is to determine 
whether the download trends in the area codes are identical up to scale differences. (Scale 
differences can be expected because of the differences in numbers of phone users for each 
area code.) If this hypothesis was true, it could be asserted that for those area codes that 
display slower growth rates than average, their growth deficit is non-structural. By placing 
more advertising efforts and commercial incentives in these areas, the phone company and 
its commercial partners could thus expect cell phone downloads to increase. Another 
interesting application of comparing trends in cell phone activity pertains to the allocation 
of bandwidth in phone networks. 

After a pilot study revealed a multiplicative structure in the trend, seasonality, and 
irregularities of the time series, a logarithmic transform was applied to the data so as to 
stabilize the variance and obtain an additive (signal + noise) model. In this context, an 
efficient way to test for proportionality between the trends in the initial data is to consider 
the alternative problem of testing for parallelism between the trends in the log-transformed 
time series. From here on, we consider an additive nonparametric time series model that, 
in our analysis of the Motorola data, pertains to the log-transformed data. Suppose that 
we observe N time series {X it }f =1 , i = 1, . . . , N, according to the model 



where the /ij are unknown smooth regression functions defined over [0, 1] and the {eu}J = i 
are mean zero error processes. The scaling device t/T in ([I]) indicates that the means E,Xu 
change smoothly in time, due to the smoothness of /ij(-). It is widely used in statistics 
and econometrics; see for example Orbe et al. (2005) and Wu and Zhao (2007). We are 
interested in testing whether the fii, i = 1, . . . , N, are parallel, namely, whether there exists 



X it = Hi{t/T) + e it , t = 1, . . . , T, 
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a function /x and numbers q such that 

H : fn(u) = a + fi(u), i = 1,...,N, u e [0,1]. (2) 

Under if , the q represent the vertical shifts between the curves m and the reference curve 
ji. They can be viewed as nuisance parameters for testing purposes. Note that testing 
for parallelism is closely related to testing for equality as, on the one hand, H is formally 
equivalent to equality of the N centered functions (/Jj — Jq Hi{u)du) and, on the other hand, 
the scalars fii(u)du can be considered as known since they are estimated at parametric 
rates while the functions /ij and /i are estimated at slower nonparametric rates. 

Various tests for comparing mean functions can be found in the regression litera- 
ture. Hardle and Marron (1990) compare two nonparametric regression curves by test- 
ing whether one of them is a parametric transformation of the other. To test equality of 
N = 2 regression curves in the setup of independent errors, Hall and Hart (1990) propose 
a bootstrap test, King et al. (1991) devise a procedure based on the L 2 -distance between 
kernel regression estimators, and Guo and Oyet (2009) apply a wavelet-based method. 
For N > 2, assuming independent errors, Munk and Dette (1998) use a test based on 
weighted L 2 -distances that requires no smoothing parameter selection. To test whether 
a nonparametric mean curve has a certain parametric shape, Bissantz et al. (2005) and 
Pawlak and Stadtmiiller (2007) appeal to signal processing theory and the Whittaker- 
Shannon sampling theorem under independent errors while Degras (2010) utilizes approx- 
imate simultaneous confidence bands in the functional data setup. Under model and 
design conditions, their tests can be adapted to assess parallelism for two mean curves. 
Young and Bowman (1995) build AN OVA- type tests for equality and parallelism in k > 2 
regression curves under i.i.d. errors. In the time series setup, to infer equality of two 
trends, Park et al. (2009) apply a graphical device assuming stationary, weakly correlated 
errors; Li (2006) builds a test based on the cumulative regression functions, assuming 
long-memory moving average errors; Fan and Lin (1998) use an adaptive Neyman test 
with stationary Gaussian linear error processes. For random designs of observations, con- 
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tributions to the comparison of regression curves include Delgado (1993), Koul and Schick 
(1997), and Lavergne (2001). 

The present work brings several contributions to the statistical problem of testing 
parallelism between trends in multiple time series. First, studies to date are based on one 
or both of the following assumptions: (i) the error processes {ea}f = i in are independent 
in time, or more generally stationary; (ii) the number N of time series is fixed and usually 
small. In this paper we relax both assumptions: the {eu}f = i can be non-stationary and N 
can be arbitrarily large. We describe the dependence of the {eu}f = i in terms of the physical 
dependence model of Wu (2005), which represents errors as being generated by series of 
i.i.d. innovations. The data-generating mechanism may be nonlinear with respect to the 
innovation process and may vary smoothly over time. This non-stationary dependence is 
realistic in practice and it generalizes the parametric or stationarity assumptions of the 
literature. Second, we devise a method of independent interest to estimate consistently the 
long-run variance function of a locally stationary time series. Third, we exploit a strong 
invariance principle to build a simulation-based method that approximates the finite sample 
distribution of the test statistic. The resulting approximation is more accurate than the 
limiting normal distribution and its implementation is faster than bootstrap alternatives. 
Fourth, we apply the test to an iterative clustering algorithm that groups time series 
according to the parallelism of their trends. The algorithm has the nice feature that it 
does not require to pre-specify in how many clusters the data will be grouped. Time series 
that are very different from all others may form a group of their own. For this reason, 
the algorithm provides valuable insights in the data that complement standard approaches 
like fc-means clustering. Another attractive by-product of the clustering algorithm is that 
it readily provides significance levels for all clusters found. 

The rest of the paper is organized as follows. Section [2] presents a test statistic based 
on the L 2 -distances between the estimators of the individual trends (/ij — q) and the 
estimator of the global trend \x in (|2]). The test statistic estimates a parallelism index. 
Its asymptotic properties are discussed in Section [3] for both fixed N and N — > oo. A 
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central limit theorem is derived under ([2]) and the test is shown to be consistent against 
local alternatives. Section [4] deals with the test implementation and provides methods 
for bandwidth selection and long-run variance estimation, as well as a simulation-based 
method to approximate the finite-sample distribution of the test statistic. Simulations are 
carried out in Section [5] to assess the empirical significance level and statistical power of 
the test procedure. The clustering algorithm is described in Section [6] and illustrated in 
Section [7] with the Motorola data. Proofs of the main results are deferred to the Appendix. 



2 Test statistic 

To ensure model identifiability under the null hypothesis H in ([2]), we assume that 

N 

5> =0 - ( 3 ) 

i=l 

A natural way to test Hq is to compare the curves /tj estimated under the general model 
([I]) to the curves q + pt, estimated under H . To estimate the common trend [i under H , 
we can use the averaged process X. t = J2iLi Xu/N for t = 1, . . . , T: 

X. t = ti(t/T) + e. t . (4) 

Similarly, define Xj. = Ylt=iXit/T, X.., e. t and e^.. In this paper we adopt the popular 
local linear smoothing procedure (Fan and Gijbels, 1996) to estimate the trends. Let K 
be a Lipschitz continuous, bounded, symmetric kernel function with support [—1,1] and 
satisfies K(u)du = 1; let b > be the bandwidth. Then the local linear estimator of 
is 

T 

jX(u) = £ Wb (t, u)X t , < u < 1, (5) 
t=i 

with the weights Wb defined by 

U \ is(( * /rr\ /u\ Sb ^ ~( u ~ t/T)S btl (u) 
w b (t,u) = K((u-t T) b) , 6 
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where 

T 



S&» = J> - t/TyK((u - t/T)/b), u e [0, 1]. (7) 
t=i 

Let (flo, $i) be the minimizer of the weighted sum 

T 

£(X t -P Q -h{u- t/T)) 2 K((u - t/T)/b). 



t=i 



Then /t(w) = /3o, and Fan and Gijbels (1996) argued that this local linear estimate has a 
nice boundary behavior. For simplicity of the procedure, it is advantageous to estimate \ii 
with the same bandwidth used for /x. This also simplifies mathematical derivations (see 



Section 4.1). In this case, the local linear estimate for /jj is 

T 

^ii u ) = y^ J w b (t,u)X it . (8) 
t=l 

The intercepts c, are estimated by 

h = ^"tm/T)-W/T)]. (9) 

t=l 

Since the same bandwidth b is used in ^ and ([8]), we have the interesting observation 
that fi{u) = iV -1 YjiLifai{ u )' Therefore, the q naturally satisfy the constraint 

There are many ways to measure the distance between the curves Cj + /}(•) and //*(■). 
In this paper we adopt the L 2 -distance 

N rl 

(u) — di — (i(u)) 2 du. (10) 



E/ (Ai 

i=i - 70 



Clearly Ajv,t is a natural estimate for the parallelism index 

Ajv = jp^ m c / — Ci — [i{u)) 2 du, (11) 

where the explicit solutions are fx(u) = ^2^ =1 ^i(u)/N and c* = L {ni(u) — jj,{u))du. 



3 Asymptotic theory 



Here we shall discuss limiting distribution and consistency of the test. In our framework we 
allow both N and T to go to infinity, and the error processes {eu}J = i can be non-stationary. 
To establish the asymptotic normality of Ajv,t> we impose structural conditions on the error 
processes {e it }J =1 , following the ideas of Wu (2005). More specifically, we assume that the 
{eu}f =1 , i = 1, . . . , N, are i.i.d. as a process {e t }f =1 of the form 

et = G(t/T;F t ), (12) 
where Tt = (. . . ,e t -i,e t ), {sj}j £ z is an innovation process with i.i.d. elements, and G(-\ •) 



is a measurable function. Equation (12) can be interpreted as an input/output physical 
system where the {£j}j-=_ 00 are the inputs and e t is the output. Assuming that G(u;J-f) 
has a finite p-th moment for some p > 0, define the physical dependence measure 

5 p (t)= sup \\G(u;T t )-G(u;^)\\ p , (13) 

0<M<1 

where T[ — (. . . , e_i,ed,£i, . . . ,e t ) and e' is a random variable such that e' , e t , t G Z, 
are i.i.d. The index S p (t) quantifies the dependence of the output e t on the inputs T t by 
measuring the distance between G J (-;J r t ) and its coupled version G(-; J 7 /). Furthermore, 
assume that G(u;J-'t) is stochastically Lipschitz continuous (SLC), that is, there exists a 
constant C such that 

\\G{ Ul] T t ) -G(u 2 ;F t )\\ p < C\u x - u 2 \ (14) 

for all U\,ii2 G [0, 1], which we denote by G G SLC. This models the non-stationarity in 
which the underlying data generating mechanism changes smoothly over time. Note that 
the {eu}f = i can be represented in the following manner: let e^, i — 1, . . . , N, k G Z, be 
i.i.d. random variables; let Tn = (. . . , £i,t_i, £it), then 

e u = G{t/T>,T it ). (15) 



7 



Assuming that Eefc = for all fceZ, let 

-y k {u)=E[G{u;F k )G(u;F )], < u < 1. (16) 
Define the long-run variance function 

= 5>*( u ) (17) 

and its squared integral 

a 2 = g 2 (u)du. (18) 



Recall that the kernel function K is Lipschitz continuous on its support [—1, 1]. Let 

J.1-2M rl 

K*(x) = J K(v)K(v + 2\x\)dv and K* 2 = J (K*(v)) 2 dv. 
We have the following result. 

Theorem 1. Let N = N(T) be such that either (i) N — > oo as T — > oo or (ii) N is fixed. 
Let b = b(T) be a bandwidth sequence such that Tb 3/2 — > oo and b — > 0. Further assume 
that G G SLC and that, for some p > 4, the following short-range dependence condition 
holds: 

oo 

^<5 p (t)<oo. (19) 

t=0 

Then under the null hypothesis H , we have 

Tb 1/2 {N - 1)- 1/2 {K N , T - EAjv,t) A N(0, o 2 Kl). (20) 



It is worth observing that the limit distribution in (20) is the same whether (i) iV — > oo 
or (ii) iV = 0(1). However, the proofs for these two cases are different; see Section 9.1 in 
the Appendix. Here we provide intuitions of the proofs. If N — > oo, the estimates Cj and fi 
will both be close to their true values. Hence the L{jjLi{u) — Q — jl{u)) 2 du, i — 1, . . . , N, in 
(10) can be approximated by the f Q (jj,i(u) —Ci — [i(u)) 2 du, which are i.i.d., and the classical 
Lindeberg- Feller Central Limit Theorem (CLT) applies. In case (ii), the Lindeberg- Feller 



CLT is no longer applicable since N = N(T) is bounded; however, we can apply the in- 



dependent and martingale approximations as in Liu and Wu (2010) and still obtain (20). 



Note that the factor (N — 1) in (20) is due to the fact that we average the N independent 
streams to get the function estimate /}, thus losing one degree of freedom. 

We now look into test consistency. Recall that Ajy,T serves as an estimate of the 
parallelism index A^r defined in (11) under both H in ^ and alternatives. Our test 
rejects H at level a if A^,t exceeds the (1 — a) quantile of its distribution. (The precise 
implementation of the test is provided in Section]!]). The next theorem asserts that this test 
is consistent against local alternatives approaching ^ such that ./V~ 1 (T& + b~ 2 )Aw — > oo, 
namely under the latter condition the power goes to 1. 

Theorem 2. Assume conditions of Theorem^ Also assume that the fa, i — 1, . . . , N, in 
([!]) have uniformly bounded second derivatives on [0, 1] . Then the parallelism test based on 
Ajv,t has unit asymptotic power if N~ l (Tb + 5 _2 )Aat — > oo. 



4 Test implementation 

We address here the implementation of Theorem [T] for hypothesis testing. In particular, 
we discuss the issues of bandwidth selection and variance estimation, and we propose a 
simulation-based procedure that improves upon the normal approximation for the test 
statistic Ajv,t- 

4.1 Bandwidth selection 

As seen in Section [2j the same bandwidth b is used in the test procedure to estimate 
both /j, and the fa, i = 1, . . . , N. In addition to simplifying the test implementation and 
theoretical study, this choice automatically corrects biases under H as noted by Hardle 
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and Mammen (1993): 

T T 

t=l t=l 
= Ci. (21) 

To select the bandwidth b, we propose a generalized cross-validation (GCV) procedure 
that can adjust for the dependence of the time series. The simulation study of Section 
[5] suggests that our test procedure is reasonably robust to the choice of b and the GCV 



method (22) performs reasonably well. Since our test procedure aims at reconstructing 
the mean function differences \L{ — fi, i — 1, . . . , N, and assess whether they are constant 
over time, it is natural to base the GCV score on the Yj = {Xu — X. t }J =l rather than on 
the original time series Xj = {X it }f =1 . Let T = ('jt,t')i<t,t'<T, where 7 t)t / = E(e t e t >), be the 
covariance matrix of the error process and let H(6) be the T x T "hat" matrix associated 
to the local linear smoother with bandwidth b. Denoting by Yj = H(6)Yj the estimator 
of \ii — /i at the design points, we propose to choose b by minimizing the GCV score 



We now consider the estimation of the covariance matrix T = (7t,t')i<t,i'<T- Due to the 
local stationarity of en, we use the local linear smoothing (Fan and Gijbels, 1996) technique 
and naturally estimate Jt,t+ki < k < T, by 

N T-k 

.V 



z ^^2^2^i,v+kW b {v/(T - k),t/{T - k)}, (23) 

1=1 v = l 

where Wb(t, u) are the local linear weights defined by ^ with T therein replaced by T — k, 
and &i V = X iv — fii(v/T), % = 1, . . . , N, v = 1, . . . , T, are the estimated residuals. Since j^v 
is small if \t — 1'\ is large, using the regularization method of banding (Bickel and Levina, 
2008), we estimate T by (7 tl t//{| t _ t /|< T 4/is } )i< t , t '<T- 
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4.2 Estimation of the long-run variance function 

In order to apply Theorem [TJ we need to estimate the critical quantity cr 2 in (18) which 



serves as the asymptotic variance (up to a known scalar) of the test statistic (10), or more 
essentially we need to estimate the long-run variance function g. For each u G [0, 1], let 



u 



{t : \t/T-u\ < r}, 



(24) 



where r = t(T) is a window size satisfying r — > and Tr — > oo as T — > oo. The points of 
J\f T (u), suitably rescaled by 1/T, become increasingly dense in [u — r, u + t] as T — > oo. By 



the local stationarity (14), the process {eit]t&M T (u) can be approximated by the stationary 



process (G(u, J~it))t&N T {u) i n the sense that 



sup max \\e it - G(u,jFit)\\ p 

0<«<1 t&N T {u) 



O(t). 



(25) 



Denote by ^(m) the sample auto-covariance of {eit\t^N T {u) at lag k and average these 



quantities over % to estimate the auto-covariance ( 16 ) by 

1 N 



lk{u) 



(26) 



Then g(w) can be simply estimated by 



9W 



K T 
k=-K T 



(27) 



for some truncation parameter Kt = [Ttq\ with bandwidth g — > and Tr^ — > oo. Indeed, 
7fc(w) will be close to zero for large k and for all u G [0, 1] under the local stationarity 



condition (14) and the short-range dependence assumption (19). More precisely, we need 



to specify the decay rate of the physical dependence measure (13) to characterize the bias 
caused by truncation. Also, the error processes {e^}, i — 1, . . . , N, are not observable in 



practice and we recommend plugging the residuals e it = X it — jl^t/T) from rt8J) into (27) 



to get an estimate g of the long-run variance function. The following theorem provides 
error bounds for both g and g. 
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Theorem 3. Assume that G e SLC, g e C 2 [0,1], Et=cA(t) < °°> and Et=T^(t) = 
0{T~ a ) for some a > 0. Then 

sup \\g(u) - g(u)\\ 2 = O (^Q~Jn + (Trg)- a + (rp)^< 1+a > + r 2 + g) . (28) 
ue[o,i] v 7 

// m addition i = (Ttq) 1 I 2 {o 2 + T^fe- 1 / 2 ) ->• ; we /mve 

sup \\~g{u) - g{u) || 2 = O L + + (Tr^)- Q + (rp) a /( 1+a ) + r 2 + <?) . (29) 

uS [0,1] v J 

The choice of banding parameters r and g that minimize the bound on the right hand 



side of (28) can depend on N, T and a in a highly complicated fashion. Nevertheless, 



when a > 2 we have the following dichotomy: 



If N > T 2a / (3a+2 \ the optimal bound in (|28j is 0(T- 2a ^ 3a+2 ^) for r x T~ Q /( 3a+2 ) 
and £xT- 2a /( 3Q+2 ); 

If N < T 2Q /( 3a+2 ) in which case N is not required to blow up, the optimal bound in 



(28) is C((TiV)- 2Q /( 5Q+2 )) for r x (TiV)- Q /( 5 «+ 2 ) and g x T" 4a /( 5a+2 )iV( Q + 2 )/( 5a + 2 ). 



In particular when the errors satisfy the geometric moment contraction condition, that is, 
62(h) decays geometrically quickly as in the case of an autoregressive process, the optimal 



bound for (28) is 0(T~ 2 / 3 logT) if N/T 2 ' 3 00 and C(T~ 2 / 5 logT) otherwise. 



Note that the bound in (29) goes to zero at a slower rate than the one in (28) and 



reaches 0(T 2//5 logT) when the geometric moment contraction condition is satisfied. 

4.3 Simulation- based approximation to the distribution of the 
test statistic 



The normal convergence in (20) can be quite slow. A popular way to improve the conver- 
gence speed is via bootstrap; see for example Hall and Hart (1990) and Vilar-Fernandez 
et al. (2007). Here we propose an alternative simulation-based method, which is easily 
implementable and has a better finite-sample performance. 
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Let Zn, i — 1, . . . , N, t — 1, . . . , T, be i.i.d. standard normal random variables. If the 
long-run variance function g is known, let X? t = g(t/T)Z it and otherwise, use the estimate 
g to define Xf t = g{t/T)Z it . Let A^ T be the test statistic associated to the Xf t , assuming 
that q = 0. By Theorem [lj Ajv,t and have the same asymptotic distribution under 
the assumptions of Theorems [l] and ij Hence, the distribution of Ajv.t can be assessed by 
simulating A^ T . Specifically, one can generate many realizations of (Xf t )J =1 , i = 1, . . .N, 
and compute the corresponding A^r T from which one can obtain the estimated (1 — ct)-th 
quantile q\- a - Based on this, one can reject at level a the null hypothesis if An,t > Qi-a, 
and accept otherwise. The validity of this method is guaranteed by the invariance principle 
(see Wu and Zhou (2011)) which asserts that partial sums of dependent random vectors 
can be approximated by Gaussian processes. 

5 Simulation study 

5.1 Acceptance Probabilities 

In this section we present a simulation study to assess the performance of our test proce- 
dure. Consider the model 

X lt = Ci + /i(t/T) + e it , (30) 



with Cj = and /z(w) = 2sin(27ra). Note that under (30), the test procedure is independent 
of the Cj. The error process {e it } is generated by e it = Q )t (t/T), where for all i 6 Z and 
u G [0, 1], the process (Ci,t( u ))tez follows the recursion 

d,t( u ) = p( u )Ct,t-i(u) + ae itt , (31) 

with the Eij being i.i.d. random variables satisfying = —1) = P(£j,t = 1) = 1/2. 

Thus, {e it }t£Z for i = 1, . . . , iV are i.i.d. AR(1) processes with time-varying coefficients. 
Let p(u) = 0.2 — 0.3m and a = 1. Easy calculations show that E,(Q tt (u)) = 0, Var((^(u)) = 
er 2 /(l — p(u) 2 ) and the long-run variance function g[u) = cr 2 /(l — p(u)) 2 . 
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In our simulation the Epanechnikov kernel K(v) = 3max(0, 1 — t> 2 )/4 is used. We 



simulate 10,000 realizations of (31) and, for each realization, 10,000 simulations of A^ r 
are performed as in Section 4.3 We are interested in the proportion of realizations for 
which the null hypothesis is correctly accepted. Acceptance probabilities are presented in 
Table [l]for different choices of T, N and b. This suggests that the acceptance probabilities 
are reasonably close to the 95% nominal levels and become more robust to the size of 
bandwidth as the sample size gets bigger. 



Table 1: Acceptance probabilities at 95% nominal levels with different T, N and b. 
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5.2 Statistical power 



In the setting of Section |5\T| with T = 300 and N = 100, we study the statistical power of 
our testing procedure. For a certain proportion (say p) of the N time series {Xu}f =l for 



i E {1, . . . , N}, we add a distortion afMd(t/T) in addition to (30), where fJ,d(u) = 2 cos(27ra) 
and a denotes the corresponding magnitude. We investigate on the rejection probabilities 
at 5% nominal levels with different choices of p and a and the results are summarized in 
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Figure [Tj It can be easily seen that the power goes to one very quickly as the magnitude 
of distortion a gets large and the proportion of different trends p approaches 0.5. 



6 Application of the test to clustering 

The test procedure developed in the previous sections can be applied to cluster collections 
of time series based on their similarity in terms of parallelism. In the sequel we identify the 
time series in (JI]) with their indexes. To build our iterative clustering algorithm, we start 
by finding the largest cluster G\ in U^ Q ' = {1, . . . , N} for which the parallelism assumption 
Ho is retained at level a. The cluster G\ is obtained by progressively removing from the 



analysis the sample units that contribute most to the test statistic (10). Specifically, H 
is first tested on U^, then on a subset C if rejected on U^, and so on so forth 
until Ho is accepted or is reduced to a single element for some k, in which case the 
algorithm ends without any cluster being found. At the second iteration, the procedure is 
repeated with the remaining time series (set := {1, . . . , iV} \ Gi) and so on so forth 
until either all time series are clustered (i.e. {1, . . . , N} = G\ U • • • U Gl for some L) or no 
more clusters of size > 1 can be formed. 

We now give a precise description of the algorithm. For the test implementation, 
the user must provide a significance level a, bandwidth b (cf. Sections [2] and 4.1), and 



parameters (r, g) (cf. Section 4.2). The user must also specify the number n of sample 
units to remove at each step of the cluster search. As long as n is small (say n — 1 for small 
iV and, say n < 5iV/100 = iV/20 for moderate to large N), this tuning parameter does 
not affect the outcome of the clustering algorithm; it however influences the computational 
time. Note that when moving from working index set to during the cluster 

search, the algorithm removes at least one and at most (iV fe — 2) sample units from U^ k \ 
where is the size of U^ k \ so that there remains at least two units to compare at the next 
step. As a result the effective number of removed units is n* = max(l, min(n, Nk — 2)). 
Also, if H is rejected on and accepted on U^ h+1 \ this may mean that too many units 



15 



(n* of them) have been removed from and that H can be retained on an intermediate 
set £/( fe+1 ) C U' C U^ k \ In this case a flag F is activated and the algorithm starts a 
dichotomic search, returning to the previous working index set and attempting to 
remove less units at each subsequent step (i.e. roughly n/2, then n/4, etc.). The following 
notations are needed for the formal statement of the algorithm: 

* k: step counter; /: group counter, F: flag. 



* Xf = N^J2 ieU(k) X it , X ik) = r^Llf, V {k \u) = and 
cf } = T- 1 J2ieu(V {fait/?) ~ fi {k) (t/T)). Recall that N k is the size of U™. 



The algorithm works as follows: 



Initialization. 



1. Set U<® :={!,..., N}, k := 0, I := 1, and F := 0. 



2. Initialize the parameters a, b, r, g, and n, with n < N . 



3. Perform the parallelism test on {Xi t }^ =l for i G C/(°) and compute the p-value. 



(a) Case p > a. 

• Compute the A; := J" 1 {/tj(-u) — cf 1 — fi^ (w)} 2 du for i G £7^ and sort them 
as A ct( i) < • • < A^jv,,). 

• Write n* := max(l, min(n, N — 2)) and set 



U^:=U^\{a(l),..., 



cr(n*)} and k := 1. 



(b) Case p < a 



• Set G 



i 



C/(°) and stop the algorithm. 



Determination of the cluster G;. 
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4. If N k = 1, then stop the algorithm. 

5. Perform the parallelism test on the {Xu}f =1 for i G and compute the p-value. 

(a) Case p > a. 

• Compute the A, := J* {fii(u) — cf* — fi^ (u)} 2 du for % G and sort 
them as A ct(1) < • • • < A CT(7Vfe ). 

• If F — 1, set n : = max(l, |_n/2_|). 

• Write n* := max(l, min(n, N k — 2)) and set 

Uik+i) .-uW\{a(l),...,a( n *)} and k := k + 1. 

• Return to step 4. 

(b) Case p < a. 

• Set F := 1. 

(i) Case n — 1. 

• Set G, := 

• If {1, . . . , N} = (d U • • • U Gi), then stop the algorithm . 
Else return to step 1 and set 

[/(°) {l,...,JV}\(G 1 U--UG i ), fc:=0, and I := I + 1. 

(ii) Case n > 1. 

• Set 

n:=max(l, Ln/2J) and f/ (fc) := \ Ml), . . . , cr(n*)}, 

where n* := max(l, min(n, A^_i — 2)). 

• Return to step 4. 

The R implementation of the algorithm can be obtained from the authors upon request. 
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7 Analysis of Motorola data 



To illustrate our parallelism test and clustering procedure, we consider a data set of hourly 
volumes of downloads from cell phones (in byte) in 129 U.S. area codes (24 area codes 
are in Center America, 87 in Eastern America, 1 in Hawaii, and 24 in Pacific America). 
Rather than studying the original data, we look into their daily sums so as to remove daily 
periodicity (which produces long-range dependence). Since the area codes have different 
numbers of phone users, we also apply a logarithmic transform (base 10) to the data to 
adjust for the scale effect. Thus, multiplicative differences in the time series become addi- 
tive, which makes it relevant to test for parallelism in the trends of area codes. Examples 
of time series as well as the estimated global trend function /i = N' 1 Y^iLi an d long-run 
variance function g are displayed in Figure [2] 

Prior to statistical analysis, the validity conditions of our theoretical results have been 
verified on the data set. In particular, the rapid decrease observed in the autocorrelation 
functions of the detrended time series (see Figure |3]) indicates that the short-range de- 



pendence assumption (19) is very plausible. Also global similarity in the autocovariance 
functions across the time series make the assumption of identically distributed error pro- 
cesses look reasonable. Finally, the 129 x 129 cross-correlation matrix of the residual time 
series has nearly zero entries outside its diagonal, which suggests that the time series are 
independent. 

We now describe the implementation of the test and clustering algorithm on the data 
set. The significance level of the test is set to 5%. The local linear estimation of the trends is 
based on a bandwidth b and a truncated standard Gaussian density. The inverse covariance 



matrix T _1 is estimated as in Section 4T Specifically, after a pilot trend estimation using a 
bandwidth 6 = 5 days, the "banding the inverse covariance matrix" technique (e.g. Bickel 
and Levina (2008)) has been applied to the sample covariance matrix of residuals with a 
banding parameter k = 6 days. The final bandwidth b is obtained by minimizing the GCV 



score (22). The long-run variance function g is estimated as in Section 4.2 based on the 
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residuals of a local linear smoothing with bandwidth b = 10. (A larger b is used to estimate 
the long-run variance function than for the trend estimation so as to make the estimate 
less sensitive to extreme observations.) The parameters r = 0.04 and g = 0.31 are chosen 
so that the estimate of g(u),u G [0, 1] utilizes about 2 weeks of data before and after a time 
point u and the autocovariances are truncated at lag = 4. These parameter values are 
based on the visual inspection of the autocovariance plots. Finally the number n of units 
to remove at each step of the cluster search (see Section |6| is set to 3, a good compromise 
between search accuracy and computational speed of the algorithm. 

The results of the analysis are presented in Tables [2] and 3. Note that performing 
the parallelism test on the entire data set resulted in a p- value < 10~ 16 . Shifting our 
focus to clustering the time series, we observe that the four largest clusters found contain 
respectively 29, 21, 20, and 10 area codes. This alone represents a sizable proportion (62%) 
of the 129 area codes under study. These clusters are displayed in Figure |4| where their 
homogeneity can be observed. The examination of Table 3 reveals that there is no obvious 
spatial pattern in the clusters. It also shows that there is no systematic relation between 
the size of a cluster and its associated p-value. Overall, our statistical analysis shows that 
most area codes under study can be classified in a small number of profiles, or clusters, 
according to the parallelism patterns in their phone download activity. These strong 
similarities across area codes would deserve to be investigated in more detail as potentially, 
they could be exploited by phone companies to e.g. better target their marketing strategies 
or improve the bandwidth allocation. 



Cluster size 


29 


21 


20 


10 


7 


5 


3 


2 


1 


# clusters 


1 


1 


1 


1 


1 


1 


2 


7 


17 


cum. prop, of iV 


22% 


39% 


54% 


62% 


67% 


71% 


76% 


87% 


100% 



Table 2: Summary of the clusters. The clusters are maximal sets of area codes for which 
the parallelism assumption is retained at the significance level 5%. 
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Cluster 1: size 29, p- value 0.095 



code 


state 


code 


state 


code 


state 


code 


state 


code 


state 


code 


state 


203 


CT 


321 


FL 


517 


MI 


617 


MA 


810 


MI 


909 


CA 


219 


IN 


323 


CA 


540 


VA 


619 


CA 


813 


FL 


941 


FL 


239 


FL 


484 


PA 


562 


CA 


646 


NY 


815 


IL 


951 


CA 


301 


MD 


508 


MA 


603 


NH 


661 


CA 


856 


NJ 


978 


MA 


302 


DE 


513 


OH 


616 


MI 


703 


VA 


859 


KY 












Cluster 2: 


size 21, p-value 0.064 








code 


state 


code 


state 


code 


state 


code 


state 


code 


state 


code 


state 


209 


CA 


586 


MI 


630 


IL 


732 


NJ 


781 


MA 


908 


NJ 


240 


MD 


609 


NJ 


631 


NY 


734 


MI 


786 


FL 






510 


CA 


610 


PA 


708 


IL 


772 


FL 


818 


CA 






561 


FL 


626 


CA 


714 


CA 


774 


MA 


845 


NY 












Cluster 3: 


size 20, p-value 0.151 










code 


state 


code 


state 


code 


state 


code 


state 


code 


state 






231 


MI 


404 


GA 


512 


TX 


803 


SC 


904 


FL 






269 


MI 


407 


FL 


704 


NC 


816 


MO 


919 


NC 






352 


FL 


412 


PA 


740 


OH 


863 


FL 


937 


OH 






386 


FL 


419 


OH 


773 


IL 


864 


SC 


989 


MI 










Cluster 4: 


size 10, p-value 0.071 










code 


state 


code 


state 


code 


state 


code 


state 


code 


state 






248 


MI 


570 


PA 


805 


CA 


847 


IL 


916 


CA 






516 


NY 


571 


VA 


808 


HI 


914 


NY 


917 


NY 





Table 3: Spatial locations of clusters. Only clusters of size s > 10 are displayed. 
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8 Conclusion 



In this paper we have presented a test methodology for assessing the parallelism between 
trends of multiple time series. The physical dependence structure considered here allows 
to flexibly model nonstationary time series without having to specify some generating 
mechanism or autocovariance function. A method for estimating the long-run variance 
function of locally stationary processes and a simulation-based device to approximate the 
distribution of statistics based on smoothed time series have been developed as by-products 
of the test methodology. Both these tools have shown good numerical performances in our 
simulations. They are of independent interest and could be used with profit in other 
statistical problems. A key assumption used to derive the theory of this paper is that the 
observed time series are independent from one another. A very interesting extension would 
be to allow for some form of dependence, e.g. to handle spatio-temporal data. 

The paper also proposes an innovative method to cluster time series according to their 
parallelism properties. This method has at least two attractive features: first, it does not 
require to prespecify the number of clusters to be found, which guarantees the homogeneity 
of the clusters and allows atypical time series to be set apart; second, it readily provides 
significance levels for each cluster, thereby giving a quantitative sense of how strong the 
parallelism assumption holds. The implementation of this clustering method has given 
meaningful results with the Motorola time series. The algorithm is computationally fast as 
the individual trend functions and long-variance functions need being estimated only once, 
while the most computationally intensive step (Gaussian process simulation to approximate 
the distribution of the test statistic) is still manageable. The ideas harnessed in this 
algorithm (greedy search, clustering based on individual contribution of sample units to 
test statistic) can be used to cluster time series according to other similarity measures than 
parallelism. An interesting direction of future research would be to compare the results of 
this type of clustering to the more conventional /c-means or hierarchical approaches. 
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9 Appendix 

In the proofs we use C to denote a constant whose value may vary from place to place. It 
does not depend on N and T. 

9.1 Proof of Theorem [U 

The techniques for handling Case (i) with large N and Case (ii) with fixed N are different. 
For the former we apply the traditional Lindeberg- Feller CLT, while for the latter, we apply 



the m-dependence and martingale approximation techniques. For details see Sections 9.1.1 



and |9.1.2| respectively. 

We start by showing that under H , the test statistic A NjT does not depend upon //(•) 
nor the q. To see this, introduce the weight averages 

T 

w b (t) = T~ 1 Y,Mt,]/T) 
With Q, g, and @, we easily see that 

T 

fii{u) - fi{u) ~Ci = ^2 i w b{t, u) - w b (t)) {ci + e it - e. t ) 
t=i 

T 

= ^2 {^(^ u ) ~ ™b{t)} [en - e. t ) . 
t=i 

The last equality stems from the fact that Ylt=i l w b(t, u) — Wb(t)] = 1 — 1 = by the well 
known property that the weight functions u>&(£, •), t — 1, . . . , T, of the local linear smoother 
sum up to one. 

9.1.1 Case (i): iV — > oo 



We shall prove the asymptotically equivalent form of ( 20 ) 



Tb l/2 N- 1/2 (A N>T - EAjv iT ) 4 N(0, a 2 K* 2 ). (32) 



22 



To this end, we use the decomposition 

N 



where 



A NtT - EAtv,t = - EA) - - Ei^) (33) 

A= ( ^2(w b (t,u) - w b (t))e it ) du 
J° t=i 

and Rn = N / ( u) — Wb(t)) e.f J du, 



and we show that asymptotically, ^ A is normally distributed and R^ is negligible. 
First, define 

rl T 2 

A° = / (^w 6 (t,w)e it J du. 
^° t=i 

By Theorem 1 in Zhang and Wu (2011), under the bandwidth conditions Tfe 3//2 — > oo and 



6 — > and the short-range dependence condition (19), we have 

Tb 1/2 (A° - EA°) A N(0, a 2 K* 2 ). (34) 



Observing that A\,...,A° N are i.i.d., it results from (34) and the Lindeberg-Feller CLT 
that 

Th 1 / 2 N 
ViV 

We now show that Tb 1 / 2 ^ 1 / 2 J2f=x( A i ~ A i) is negligible as N,T -»■ oo. Let 



Jj = y^w 6 (t)e a and jj = y^w 6 (t)e^, where w 6 (t) = / 
t=i t=i ■ /o 



Noting that max^ = 0(T *) and max t |wft(t)| = 0(T 1 ), one can obtain from 

Lemma 1 in Liu and Wu (2010) that || J;|| 4 = 0(T^ 2 ) and \\jA\ 4 = O^- 1 / 2 ). Hence, 

\\A°-A l \\ 2 2 = \\J 2 -2J l j l \\l = 0{T- 2 ) (36) 
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and by the i.i.d. character of the (A° — Ai), one deduces that 



N 



Tb ^ N -^J2(A--A l 



t=i 



0(b). 



(37) 



We proceed to study the remainder term (R N — E,R N ) in (33). By expanding R N and 
using the i.i.d. character of the N time series, one easily finds that 

R N L A\ + J\ - 2JJ U (38) 

where = stands for equality in distribution. The terms in the above expansion have been 



studied before. More precisely, the relations (34) and (36) yield 



\Tb 1/2 N- 1/2 (R N - ER N ) \\l = 0(N- r ) + OiN^b). 



(39) 



Putting together (33), (35), (37), and (39), one obtains the asymptotic normality (32). 



9.1.2 Case (ii): N is fixed 

Recall that e it = G(t/T; To). For Qt(u) = G(u;J-i t ), define 

Cit{u) = E(£#(li) le^t-m+l) £ i,t-m+2, • • ■ j £ i,t)- 

Then the process {Cit{u)}t& is m-dependent with long-run variance function g* converging 
to g as m — > oo. As in the proof of Theorem 1 in Zhang and Wu (2011), we introduce the 
martingale difference 

oo 

1=0 

m 

1=0 

Observe that D* t , 1 < t < T, are also m-dependent. Let D\ t = D* t — D* t , where 
D* t = Sili D* t /N; let (a*) 2 = ^(g*{u)) 2 du. By the argument of Theorem 1 in Zhang 



and Wu (2011), to derive the asymptotic normality (20), it suffices to show that as T — > oo, 

N N 



T 2 b(N 



t-t'w* 



\<t<t'<T 



2Tb 



*\2 



(40) 



1=1 i'=l 
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Since the D[ t , i = 1, . . . , N, are Ltd., we see that E(Dt t Z>t /)t ) = g*{t){N-l)/N i£i = i' 
and E(D| t jDt, ( ) = —g*(t)/N if i ^ z'. With a few manipulations, we then obtain 

AT JV 
i=l i'=l 

Furthermore, with the continuity of g>*, classic arguments for kernel smoothing show that 

't-t'\\ 2 

T 2 6 



4s E (^(w))V(w=w J +°(i)- 



l<i<t'<T 



Hence (40) is proved and the asymptotic normality (20) follows. □ 



9.2 Proof of Theorem H 



By (33) and the Cauchy-Schwarz inequality, we can write 



N / N 1/2 \ 

An,t = In,t + (X)^ - r n) + ^ P 4(t(E^ - ^) 

8=1 \ 1=1 / 



(41) 



where 



N rl 



= J2 S { J2 (Mt,u) - w b (t)) (ja(t/T) - v(t/T)) Y<iu 
i=i Jo t=1 

and /i = iV -1 J^ili ^i- By the approximation properties of local linear smoothers (see for 
example Proposition 1.13 in p. 39 of Tsybakov (2009)), we obtain 

I N , T = A N + 0(N(b 2 + T- 1 )), (42) 

provided that the i — 1, . . . , N, have uniformly bounded second derivatives on [0, 1]. 



On the other hand, we know from (35), (37), and (39) that 

N 



i=l 



(43) 



By the stochastic Lipschitz continuity (14) and the short-range dependence condition (19), 



and by properties of weight functions of local linear smoothers (see Lemma 1.3 in p. 38 of 
Tsybakov (2009)) we also have EA { = 0((Tb)- 1 ). Hence, 



A N)T = A N + (Nb 2 ) + O {N/Tb) + o P (N/Tb) . 



(44) 
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If N 1 Ajv converges to at a rate slower than b 2 + 1/Tb, then N 1 (6 2 + 1/T6)A./v,t oo 
in probability. Hence the test based on A N)T has unit asymptotic power. □ 

9.3 Proof of Theorem M 

Let gi{u) = 7ifc( u ) be the estimated long-run variance based on {e it }J =1 , where 

K T = [Trg\ is the truncation order and j ik = \ Mt ^_ w J2t,t+keM T (u) e itZi,t+k is the sample 
autocovariance at lag k. Since the cardinality |iV T (w)| is of order Tt, one sees that 

9i(u) = \ t/? \\ e it e it'K{\t-t>\<K T } ■ (45) 

By the argument of Proposition 1 in Liu and Wu (2010), it can be shown that sup ue r ji || <7i(w) 
E^(w) || 2 = O(yfo) and by the i.i.d. property of the {ei t }f =1 , one deduces that 



sup \\g{u)-Eg{u)\\ 2 = 0{y^/N). (46) 
we [0,1] 

The expectation E^(u) can be used to approximate the truncation of g(u) to order 



Kt thanks to the stochastic Lipschitz continuity (14) and the martingale decomposition 
of Wu (2007). Specifically, let T 2 (k) = Y^LaHfiHi + Then for a11 u e [0, 1] and 
t, t' e M T (u) such that \t-t'\ < K T , it holds that 

|E(e^) - 7M1 (t/T)| < C(T 2 (\t-t'\) A (rg)). (47) 

Moreover, we obtain after easy calculation that 

oo 

^(r 2 (fc) A (rg)) = 0((rg) a ^). (48) 

fe=0 



Taking the expectation in (45) and adding terms so that the summation index set is 
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{(t, t') :te Mr(u), 1 < t' < T, \t - t'\ < K T }, it stems from (prb and Q that 



T 

Wg i (u)= l + ^ ) [ ^n^w)Ht-t'\<K T } + 0{K T )\ 



\ JVt[U >\ teXr(u)k=-K T 

jj^jj E E 7*«/T) +0 (M 



+ 0(0) 



ttN T {u) k=-K T 



oo 

E {?{t/T)-2 E 7,(t/T))+0((r^r/( 1 ^) + 0(^) 



(49) 



teA/VW 



k=K T 



In (49), a Taylor expansion of g e C [0, 1] at order 2 yields 



sup 

ue[o,i] 



0(r 2 ^ 



(50) 



Also, the martingale decomposition of Wu (2007) can be applied to show that sup u6 r ]i |7fc(it)| < 
T 2 (k), so that under the assumptions of Theorem 2, 



OO r OO \ 

sup \lM\ = 0[ E S 2 (k))=0{(Trg)~ a ). 
«e[o,l] k = KT \ k=KT J 



(51) 



Finally, to obtain (28), it suffices to note that Eg (it) = E^(it) 



To derive (29), an easy calculation shows that 
2 



l^r(«)l 



E E (^** _ e ^) e **'^{i<-t'i<^T} 



teJVV(it) t'eJV T (u) 



77TT E E _ O^if -eif)%-f |<JC 2 



+ LA/" T (u)| 
InAu) + II N t(u). 



Noticing that e it - e it = 0i(t/T) - Yh»=i Pi(t"/T)w b (t",t) - Yh»=i e it »w b (t" ,t), we have 

max ||ei t - e ft || p < C(b 2 + T' 1 ' 2 ^ 1 ' 2 ). 

t=l,...,T 

Hence by Lemma 1 in Zhang and Wu (2011), we have sup ug r 01 i ||/jv,t(w)|| p < Cl and 



su P«e[o,i] ||-^at,t(w)||p < Cl 2 , which proves (29). □ 
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Examples of time series 
and global trend 



Long-run variance function 
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Figure 2: Left panel: examples of download volume time series (daily-totaled and log- 
transformed) and estimated global trend in thick line. Right panel: Estimated long-run 
variance function. 
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Figure 3: Autocorrelation functions of the residual time series after a local linear fit with 
bandwidth b = 10 days. 
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Figure 4: Trends in the clusters of size s > 10. After a log-transform of the data, 
trends are obtained by smoothing the time series with a bandwidth 6 = 4 days. 
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